Script Set-up

Code chunks will be hidden in html output. Look at the rmarkdown file for setting-up.


Table of Contents


Section 1 - Evaluation of Street Dictionaries & Street Cleaning Function

Street Cleaning Function was editted by Chang on 7th February 2020

The following Changes were made:

  • The combined Street Dictionary had several issues regarding “STR”, “ST” and “STREET”. They have now been resolved.

The Street Dictionary (combined and full) for Manhattan and Brooklyn had several abnormalities. From manually checking, the issues that were identified include:

  • Inconsistent String Pattern
    • e.g. “4 W”, “W 4”
    • e.g. “E 97” (Abnormal spacing in between words)
  • “ST” seems to have been unintentionally removed from several street names
    • e.g. “CHENUT”, “CHENUT ST”, “HOUON E”, " W HOUON ST“,”FIR AVE“,”EA END AVE“,”CHRIIE“,”UYVESANT ST“,”CHRISOPHER"
  • Possible Misspellings
    • e.g. “WILLET”, “WILLETT”, “GOERICK”, “GOERCK”, “EDGECOMBE”, “EDGECOMB”
  • Inconsistent Street Naming
    • e.g. “LEXINGTON AVE”, “LEX AVE”
  • Random Street Names
    • e.g. “B”, “C”, “WE ST”, “NYC HOME AGED AMP; INFIRM”, “NY ASYLUM SIERS ST DOMINIC”, “MANHATTAN ATE HOSPITAL WARDS ISLAND”, “HOME AGED LITTLE SIERS POOR”
  • Apostrophe S issues
    • e.g.“MARK\’S PL”, “RANDALL\’S ISLAND PARK”, “BEDLOE\’S ISLAND” (Apostrophe S)
  • some NA and NULL values were present

Section 2 - EDA of output from the Manhattan cleaning script

Manhattan’s Output

From the sample dataset, an output of 557,357 rows by 33 columns was derived.

There were 1,478 Enumeration Districts, 1,276 microfilms and 8,557 unique street names within the dataset.

Pre-Street-Dictionary Edit

The results from the Street Matching function were as follow:

  • 28% Perfect Match
  • 12% Identical Match
  • 47.5% Singular Mode
  • 3% Multiple Modes
  • 1.2% NAs
  • 0.5% No Match
Post-Street-Dictionary Edit (02/07/2020)

The results from the Street Matching function were as follow:

  • 34.7% Perfect Match
  • 25.4% Identical Match
  • 29.8% Singular Mode
  • 2% Multiple Modes
  • 0.8% NAs
  • 0.3% No Match

Significant improvements were observed in result types 1 and 2 after updating the Street Dictionary

  • 5% increase in perfect matches
  • 12% increase in identical matches
# Distribution of Result Type in Manhattan
plot(table(mn_output$result_type), 
     type = "h", 
     col = c("blue", "red", "orange", "purple", "green", "pink"), 
     lwd = 10,
     main = "Updated Result Type for Manhattan",
     ylab = "Count",
     xlab = "Result Type")

Problematic Result Types still exist (e.g. 5 and 6 result types)
# Get a list of NA and no matches
mn_56_result_list <- mn_output %>% filter(result_type == 5 | result_type == 6) %>% select(township, ED, street_add, best_match, result_type) %>% arrange(ED) %>% dplyr::distinct()
datatable(mn_56_result_list, options = list(autoWidth = TRUE)) %>%
  DT::formatStyle(columns = 1:5, fontSize = '70%')

Problematic EDs

Out of the problematic Street Matches (result type 5 and type 6), there is a trend of some EDs being more problematic than others, i.e. more entries of 5 or 6 within certain Enumeration Districts.

# Problematic EDs for Result Type 5
# Treshold set to 50 (arbitrarily decided)
mn_output %>% filter(result_type == 5) %>% 
  select(ED) %>% table() %>% sort(decreasing = TRUE) %>% head(19)
## .
## 0815 0009 0836 0043 0676 1272 0451 0818 0291 0049 0142 0363 1643 0766 1000 1202 
## 1432  305  226  213  165  153  113  112  101  100  100  100   89   85   78   74 
## 0316 1519 1679 
##   71   69   69
# Problematic EDs for Result Type 6
# Treshold set to 50 (arbitrarily decided)
mn_output %>% filter(result_type == 6) %>% 
  select(ED) %>% table() %>% sort(decreasing = TRUE) %>% head(8)
## .
## 1481 1913 0081 0184 0782 0242 1941 0202 
##  528  248  113   96   83   75   67   62
# Average result_type by ED
average_result_type_mn <- mn_output %>% 
  select(ED, result_type) %>%
  group_by(ED) %>%
  summarise(mean_result_type = mean(result_type))

average_result_type_mn$ED <- as.numeric(average_result_type_mn$ED)

mn_result_type_plot <- ggplot(average_result_type_mn, 
                              aes(x = ED, y = mean_result_type)) +
  theme_classic() +
  geom_point() + 
  labs(x = "ED", y = "Average Result Type", title = "Mean Result Type by ED (MN)"
  ) +
  geom_vline(xintercept = 1471, color = "red", linetype = "dotted") +
  geom_vline(xintercept = 1652, color = "red", linetype = "dotted")

ggplotly(mn_result_type_plot)
# Standard Deviation of result_type by ED
sd_result_type_mn <- mn_output %>% 
  select(ED, result_type) %>%
  group_by(ED) %>%
  summarise(sd_result_type = sd(result_type))

sd_result_type_mn$ED <- as.numeric(sd_result_type_mn$ED)

mn_sd_result_type_plot <- ggplot(sd_result_type_mn, 
                              aes(x = ED, y = sd_result_type)) +
  theme_classic() +
  geom_point() + 
  labs(x = "ED", y = "Standard Deviation of Result Type", title = "Standard Deviation of Result Type by ED (MN)"
  ) +
  geom_vline(xintercept = 1394, color = "red", linetype = "dotted") +
  geom_vline(xintercept = 1652, color = "red", linetype = "dotted")

ggplotly(mn_sd_result_type_plot)
# Map for Result Type in Manhattan
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(mn_map) + tm_polygons("result_type")
## Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0

Fill Down Function - Manhattan

Out of the 28% of Perfect Matches (156k entries) 1,717 entries were matched via the Fill Down function 03_Matched_Street_Fill_Down - 1.1%.

Flagged House Numbers - Manhattan

Out of 557,357 entries, 159,669 entries were flagged for house number changes (i.e. 0 or 1) - 28.6%.

Previously, 7,350 entries were flagged with “1” meaning the house number was editted by the function. That has now grown to 7,520 (post street cleaning function edits).

This flag refers to: The initial household number might have been “195-7”, “34 TO 36” or “112 114”. There would be a split between the house numbers and hn_1 will be 195 and hn_2 will be 197 as in the first example.

# Problematic Enumeration Districts for House Number Cleaning
mn_output %>% filter(flag_hn_cleaned == 1) %>% 
  select(ED) %>% table() %>% sort(decreasing = TRUE) %>% head(24)
## .
## 0190 0984 0805 0843 0094 1267 0967 0212 0178 0024 0586 0924 0403 1271 1167 1269 
##  634  289  257  245  191  187  177  173  154  143  132  120  102  102   92   85 
## 0256 0435 1073 0509 1036 0119 0681 1170 
##   82   82   81   79   74   71   63   63

Out of all the house hold number entries, 364,730 entries were filled in via the Fill Down function 05_House_Number_Fill_Down - 65.4%.

That has increased to 368,948 - 66.1% after the street cleaning function was editted.


Section 3 - EDA of output from the Brooklyn cleaning Script

Brooklyn Output

From the sample dataset, an output of 371,833 rows by 33 columns was derived.

There were 1,106 Enumeration Districts, 1,527 microfilms and 7,076 unique street names within the dataset.

Pre-Street-Dictionary Update

The results from the Street Matching function are as follow:

  • 76.5% Perfect Match
  • 7.9% Identical Match
  • 9.5% Singular Mode
  • 1.5% Multiple Modes
  • 1.3% NAs
  • 0.8% No Match
Post-Street-Dictionary Update
  • 78.7% Perfect Match
  • 8.3% Identical Match
  • 7.3% Singular Mode
  • 1.5% Multiple Modes
  • 1.0% NAs
  • 0.7% No Match

In contrast to Manhattan, there is a larger proportion of Perfect Matches.

# Distribution of Result type in Brooklyn
plot(table(bk_output$result_type), 
     type = "h", 
     col = c("blue", "red", "orange", "purple", "green", "pink"), 
     lwd = 10,
     main = "Result Type for Brooklyn",
     ylab = "Count",
     xlab = "Result Type")

# Get a list of NA and no matches
bk_56_result_list <- bk_output %>% filter(result_type == 5 | result_type == 6) %>% select(township, ED, street_add, best_match, result_type) %>% arrange(ED) %>% dplyr::distinct()
bk_56_result_list

Problematic EDs - Brooklyn

Out of the problematic Street Matches (result type 5 and type 6), there is a trend of some EDs being more problematic than others, i.e. more entries of 5 or 6 within certain Enumeration Districts.

# Problematic EDs for Result Type 5
# Treshold set to 50 (arbitrarily decided)
bk_output %>% filter(result_type == 5) %>% 
  select(ED) %>% table() %>% sort(decreasing = TRUE) %>% head(18)
## .
## 0915 1007 0556 1038 0491 0570 0562 0194 0905 1028 0580 1026 0807 1408 0935 0154 
##  252  117  104   96   90   79   74   63   63   62   61   61   59   57   55   54 
## 0469 0365 
##   53   52
# Problematic EDs for Result Type 6
# Treshold set to 50 (arbitrarily decided)
bk_output %>% filter(result_type == 6) %>% 
  select(ED) %>% table() %>% sort(decreasing = TRUE) %>% head(10)
## .
## 0161 0058 0253 0572 0909 0488 0733 0158 1073 0730 
##  100   88   83   78   75   60   56   55   51   46
# Average result_type by ED
average_result_type_bk <- bk_output %>% 
  select(ED, result_type) %>%
  group_by(ED) %>%
  summarise(mean_result_type = mean(result_type))

average_result_type_bk$ED <- as.numeric(average_result_type_bk$ED)

bk_result_type_plot <- ggplot(average_result_type_bk, 
                              aes(x = ED, y = mean_result_type)) +
  theme_classic() +
  geom_point() + 
  labs(x = "ED", y = "Average Result Type", title = "Mean Result Type by ED (BK)"
  ) +
  geom_vline(xintercept = 1103, color = "red", linetype = "dotted") +
  geom_vline(xintercept = 1406, color = "red", linetype = "dotted")

ggplotly(bk_result_type_plot)
# Standard Deviation of result_type by ED
sd_result_type_bk <- bk_output %>% 
  select(ED, result_type) %>%
  group_by(ED) %>%
  summarise(sd_result_type = sd(result_type))

sd_result_type_bk$ED <- as.numeric(sd_result_type_bk$ED)

bk_sd_result_type_plot <- ggplot(sd_result_type_bk, 
                              aes(x = ED, y = sd_result_type)) +
  theme_classic() +
  geom_point() + 
  labs(x = "ED", y = "Standard Deviation of Result Type", title = "Standard Deviation of Result Type by ED (BK)"
  ) +
  geom_vline(xintercept = 1103, color = "red", linetype = "dotted") +
  geom_vline(xintercept = 1406, color = "red", linetype = "dotted")

ggplotly(bk_sd_result_type_plot)
# Map for Result Type in Brooklyn
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(bk_map) + tm_polygons("result_type")

Fill Down Function - Brooklyn

Out of the 76.5% of Perfect Matches (284k entries) 1,005 entries were matched via the Fill Down function 03_Matched_Street_Fill_Down - A much smaller percentage than Manhattan. This suggests that the recording of entries might be more accurate in Brooklynn or that the street directory is more well-developed. Nevertheless, errors are still present.

Flagged House Numbers - Brooklyn

Out of 371,833 entries, 179,303 entries were flagged for house number changes (i.e. 0 or 1) - 48.2%.

2,563 entries were flagged with “1” meaning the house number was editted by the function. Brooklyn also has several abnormal initial house numbers. E.g. “222 1/2”, “192 3TH”. Otherwise, the issues are similar to Manhattan.

# Problematic Enumeration Districts for House Number Cleaning
bk_output %>% filter(flag_hn_cleaned == 1) %>% 
  select(ED) %>% table() %>% sort(decreasing = TRUE) %>% head(11)
## .
## 0614 0822 0016 0013 0518 0456 0321 0239 0846 0029 0358 
##  335  295  159  149  116  113  108   93   89   86   86

Out of all the house hold number entries, 182,950 entries were filled in via the Fill Down function 05_House_Number_Fill_Down - 49.1%.


Section 4 - Other Issues

There are missing EDs in the Dataset

- There are missing EDs in the `sample` dataset and missing EDs in the `shapefiles`
- The EDs could be names wrongly
    - In the Shapefile, the max ED is 1480 but the max ED in the output runs till 1941
# Missing EDs in Manhattan

# Manhattan ED list
mn_ed_list <- mn_map@data$ED %>% sort()

mn_output_ed <- mn_output$ED %>% unique() %>% as.numeric() %>% sort()
# Which EDs are in the Shapefile but missing from the Manhattan dataset
mn_ed_list[!(mn_output_ed %in% mn_ed_list)]
## [1]   43   47 1431 1519 1742 1743 1745
length(mn_ed_list[!(mn_output_ed %in% mn_ed_list)])
## [1] 7
# Which EDs are in the dataset but missing from the Manhattan shapefile
mn_output_ed[!(mn_ed_list %in% mn_output_ed)]
## [1]  699 1070 1175 1712 1727 1737 1745 1747 1748
length(mn_output_ed[!(mn_ed_list %in% mn_output_ed)])
## [1] 9
# Missing EDs in Brooklyn

# Brooklyn ED list
bk_ed_list <- bk_map@data$ED %>% sort()

# Brooklyn Output EDs
bk_output_ed <- bk_output$ED %>% unique() %>% as.numeric() %>% sort()
# Which EDs are in the Shapefile but missing from the Brooklyn dataset
bk_ed_list[!(bk_output_ed %in% bk_ed_list)]
## [1] 512
length(bk_ed_list[!(bk_output_ed %in% bk_ed_list)])
## [1] 1
# Which EDs are in the dataset but missing from the Brooklyn shapefile
bk_output_ed[!(bk_ed_list %in% bk_output_ed)]
## [1]  761  950  958  963  964 1040 1086
length(bk_output_ed[!(bk_ed_list %in% bk_output_ed)])
## [1] 7


Problematic Matches - Manhattan

There are 39,311 entries with NA best_match. Their street_add entries are also NA. There are 795 EDs with these problems.

  • The original sample data might be corrupted. Or have missing street addresses
mn_output %>% filter(is.na(best_match)) %>% sample_n(5)
na_mn_output <- mn_output %>% filter(is.na(best_match)) %>% select(ED) %>% unique()

na_mn_output <- na_mn_output %>% mutate('NA best match' = 1)

datatable(na_mn_output, options = list(autoWidth = TRUE)) %>%
  DT::formatStyle(columns = 1:5, fontSize = '70%')
na_mn_map <- merge(mn_map, na_mn_output, 'ED', 'ED')
# Highlight Example
example <- mn_output %>% filter(ED == '0227')
example[41:60,]
# Map for NA Street Address / Best Match in Manhattan
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(na_mn_map) + tm_polygons("NA best match")
  • Street Dictionary has “0”, which is deriving this issue.
  • There are 80 such streets affected
# Another problematic best match is "0"
mn_output %>% filter(best_match == 0) %>% sample_n(10)


Problematic Matches - Brooklyn

There are 9,264 entries with NA best_match. Their street_add entries are also NA. There are 427 EDs with these problems.

bk_output %>% filter(is.na(best_match)) %>% sample_n(5)
na_bk_output <- bk_output %>% filter(is.na(best_match)) %>% select(ED) %>% unique()

na_bk_output <- na_bk_output %>% mutate('NA best match' = 1)

datatable(na_bk_output, options = list(autoWidth = TRUE)) %>%
  DT::formatStyle(columns = 1:5, fontSize = '70%')
na_bk_map <- merge(bk_map, na_bk_output, 'ED', 'ED')
# Map for NA Street Address / Best Match in Brooklyn
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(na_bk_map) + tm_polygons("NA best match")
  • Brooklyn has fewer of such “0” problems.
  • 11 unique streets are affected
# Another problematic best match is "0"
bk_output %>% filter(best_match == 0) %>% sample_n(10)

Appendix

Examples of the different result types

Perfect Match
# Sample of Result Type 1 (Perfect Match)
mn_output %>% filter(result_type == 1) %>% select(ED, street_add, best_match, result_type) %>% head(1)
Identical Match
# Sample of Result Type 2 (Identical Match)
mn_output %>% filter(result_type == 2) %>% select(ED, street_add, best_match, result_type) %>% head(1)
Singular Mode Match
# Sample of Result Type 3 (Singular Mode)
mn_output %>% filter(result_type == 3) %>% select(ED, street_add, best_match, result_type) %>% head(1)
Multiple Modes Match
# Sample of Result Type 4 (Multiple Modes)
mn_output %>% filter(result_type == 4) %>% select(ED, street_add, best_match, result_type) %>% head(1)
“NA” Match
# Sample of Result Type 1 (NA)
mn_output %>% filter(result_type == 5) %>% select(ED, street_add, best_match, result_type) %>% head(3)
No Match
# Sample of Result Type 6 (no match)
mn_output %>% filter(result_type == 6) %>% select(ED, street_add, best_match, result_type) %>% head(4)

Appendix - Trouble Shooting / Manual Checking

List of current issues: * There are “NA” result_type, which are not captured in our 5 and 6s * These entries typically have missing street_add * 38,237 such entries in Manhattan (6.86%) * 8,462 such entries in Brooklyn (2.27%)

  • There are entries that match to “0” in the street dictionary
    • These entries typically have result_type 4
    • 1,018 such entries in Manhattan (0.18%)
    • 1,032 such entries in Brooklyn (0.27%)
  • 5 and 6 matches

    • String matches reliably close but still produces a 6 output
      • EAST HOUSTON STREET –> E HOUSTON (result_type == 6)
      • CLINTON –> CLINTON AVE
      • BROOM STREET –> BROOME
      • DEKALB & AVENUE –> DE KALB AVE
      • Recommendation: Confident manual match
    • String matches part of the best match but still produces a 5 output
      • EAST 116 ST BEBRA FIRST AND PLED –> E 116
      • Recommendation: Confident manual match
    • Numerical String Unreliable match (produces either 5 or 6 result_type)
      • East 110 –> E 116
      • Recommendation: Investigate before manual match
    • Spelt out form numerical matches
      • EAST ONE HUNDRED & TWENTIETH ST –> E 119
      • EAST ONE HUNDRED AND THIRTEENTH –> E 113
      • EAST 99THTREET –> E 99
      • Recommendation: Update street dictionary to turn spelt out streets into numerical format. (Current function should already be doing this, especially the EAST to E). Confident manual match.
    • Completely random output with either 5 or 6 result_type
      • ?? –> E120
      • 444 –> 1 AVE
      • RIRIE TWNRD 117 ST 118 ST –> PLEASANT AVE
      • Recommendation: NULL
combine_56_result_list
  • Is it possible to use Hunspell’s spell check function to reduce the number of poor best matches?
    • There are 771 words that were returned from the hunspell function
    • There is a strong use case for certain streets
      • GREENRICH –> GREENWICH
      • CARRAL –> CANAL
      • COYAL –> CANAL
      • CHRYSTEE –> CHRYSTIE
      • ELDREDGE –> ELDRIDGE
      • NAPOLK ST –> NORFOLK
      • TEX –> LEX AVE
# Use hunspell to get a list of possible mispelt streets
misspelt <- hunspell_parse(combine_56_result_list$street_add) %>% unlist() %>% hunspell() %>% unlist()
misspelt_list <- misspelt[misspelt != "character(0)"]
misspelt_list %>% head(100)
##   [1] "COAVER"    "MASHIE"    "GRUMVIST"  "HASSAN"    "POILLION"  "GARE"     
##   [7] "BWAY"      "CLUFF"     "DAK"       "DODANE"    "DWAY"      "HAULL"    
##  [13] "MALLERY"   "NETHERY"   "REEN"      "MDW"       "STS"       "STIVER"   
##  [19] "STREETOR"  "COLER"     "RENNICK"   "REDERICK"  "STOMPSON"  "BWAY"     
##  [25] "GELL"      "STEWER"    "MAXAM"     "RAETH"     "DABY"      "STR"      
##  [31] "DEMER"     "LUCKTE"    "JARVIST"   "EIGELBACH" "FORSETH"   "ABRIGADOR"
##  [37] "JONAR"     "DWAIN"     "ESSEY"     "HESLER"    "STR"       "CHTYSTIE" 
##  [43] "CHRYSTEL"  "DELANCY"   "BROOMES"   "OLAMARY"   "BRANNAS"   "LAWTON"   
##  [49] "COLUMBAN"  "STAUTON"   "HOUTON"    "SEART"     "CASO"      "FIANT"    
##  [55] "ARELENE"   "FIANT"     "ARLANE"    "BEBRA"     "PLED"      "FITER"    
##  [61] "OSTER"     "DETRZ"     "SETREM"    "PLESAN"    "STRET"     "RIRIE"    
##  [67] "TWNRD"     "STR"       "LANDHAM"   "ELLO"      "THTREET"   "ROOP"     
##  [73] "SANDTS"    "WACT"      "SHIRD"     "LEXI"      "SEVE"      "HUNRED"   
##  [79] "HEMDRED"   "WOSTER"    "FW"        "FEIST"     "SECO"      "EIGHTTH"  
##  [85] "VASSE"     "COLUMB"    "HUNDRETH"  "TENETH"    "COURENT"   "MARTETTO" 
##  [91] "NICH"      "MCH"       "JUMEL"     "BRAKSWAY"  "DYCKMAN"   "DYCKMAN"  
##  [97] "TRYON"     "BWAY"      "NIDITO"    "ARY"